Word Segmentation for Chinese Novels
نویسندگان
چکیده
Word segmentation is a necessary first step for automatic syntactic analysis of Chinese text. Chinese segmentation is highly accurate on news data, but the accuracies drop significantly on other domains, such as science and literature. For scientific domains, a significant portion of out-of-vocabulary words are domain-specific terms, and therefore lexicons can be used to improve segmentation significantly. For the literature domain, however, there is not a fixed set of domain terms. For example, each novel can contain a specific set of person, organization and location names. We investigate a method for automatically mining common noun entities for each novel using information extraction techniques, and use the resulting entities to improve a state-of-the-art segmentation model for the novel. In particular, we design a novel double-propagation algorithm that mines noun entities together with common contextual patterns, and use them as plug-in features to a model trained on the source domain. An advantage of our method is that no retraining for the segmentation model is needed for each novel, and hence it can be applied efficiently given the huge number of novels on the web. Results on five different novels show significantly improved accuracies, in particular for OOV words.
منابع مشابه
Specification for Segmentation and Named Entity Annotation of Chinese Classics in the Ming and Qing Dynasties
The quality of text segmentation and annotation plays a significant role in Natural Language Processing especially in downstream applications. This paper presents the specification for word segmentation and named entity annotation targeted for novels in the Ming and Qing dynasties. The purpose of this work is to build the foundational work for computer-aided lexical semantic analysis of classic...
متن کاملAdaptive Chinese Word Segmentation with Online Passive-Aggressive Algorithm
In this paper, we describe our system1 for CIPS-SIGHAN-2010 bake-off task of Chinese word segmentation, which focused on the cross-domain performance of Chinese word segmentation algorithms. We use the online passive-aggressive algorithm with domain invariant information for cross-domain Chinese word segmentation.
متن کاملThe CIPS-SIGHAN CLP2010 Chinese Word Segmentation Backoff
The CIPS-SIGHAN CLP 2010 Chinese Word Segmentation Bakeoff was held in the summer of 2010 to evaluate the current state of the art in word segmentation. It focused on the crossdomain performance of Chinese word segmentation algorithms. Eighteen groups submitted 128 results over two tracks (open training and closed training), four domains (literature, computer science, medicine and finance) and ...
متن کاملThe CIPS-SIGHAN CLP 2014 Chinese Word Segmentation Bake-off
This paper summarizes the SIGHAN 2014 Chinese Word Segmentation bakeoff in several aspects such as dataset, evaluation results. In addition, we analyze errors of segmentation by instance and make a suggestion for improving segmentation systems. 1 Goal of the Chinese word segmentation bake-off Chinese Word Segmentation is the preliminary step for Chinese information processing, which is extremel...
متن کاملChinese Word Segmentation Based On Direct Maximum Entropy Model
Chinese word segmentation is a fundamental and important issue in Chinese information processing. In order to find a unified approach for Chinese word segmentation, the author develop a Chinese lexical analyzer PCWS using direct maximum entropy model. The paper presents the general description of PCWS, as well as the result and analysis of its performance at the Second International Chinese Wor...
متن کامل